cd "C:\Users\remij\Desktop\Ghana Missions 06242021 JOEG JEBO\Data_Analysis\Census 2000 New Data"

* This is the database we use for the 10% census. 
* The file "pop_all1" was created by Jedwab and Moradi 2016 (REStat). 
* Remi Jedwab & Alexander Moradi, 2016. "The Permanent Effects of Transportation Revolutions in Poor Countries: Evidence from Africa," The Review of Economics and Statistics, MIT Press, vol. 98(2), pages 268-284, May.
* See their replication files for details on the raw census data. 

* The data is at the individual level (10% census). 
* We know the enumeration area (EAs) of each individual. 
* While there are many EAs, they are not all fully contained with each grid cell. We thus assign them "probabilistically" to different EAs.
* Using ArcGIS and GIS files of the EA boundaries and grid cell boundaries, we "cut" the individuals into different grid cells. 
* We then use below "prop_in_grid" which represents the probability that the individual belongs to the cell. 

use pop_all1, clear
* The data is at the individual-grid cell level. 
count
* 2,434,253
codebook gridcell 
* 2,079
sum prop_in_grid, d
* A same individual can be entirely in one cell (in which case the weight is equal to 1) or across several cells (in which case it is below 1)
* If we multiply the number of observations by the mean weight, we obtain the total number of individuals = 2,434,253*0.7742199 = 1,884,647. This corresponds to 10% of the total census population in 2000.
order id gridcell line_number prop_in_grid sex age 
sort id gridcell line_number 

* We now create the variables we need. 

*** AGE ***

gen ageb4 = (age <= 4)
gen age15 = (age >= 15)
gen age18 = (age >= 18)
gen age25 = (age >= 25)
replace ageb4 = . if age == .
replace age15 = . if age == .
replace age18 = . if age == .
replace age25 = . if age == .

*** RELIGION ***

gen rel_cath = (religion == 1)
gen rel_prot = (religion == 2)
gen rel_chri = (rel_cath == 1 | rel_prot == 1)
gen rel_chri2 = (rel_cath == 1 | rel_prot == 1 | religion == 3 | religion == 4)
gen rel_pent = (religion == 3)
gen rel_otchri = (religion == 4)
gen rel_prot2 = (rel_prot == 1 | religion == 3 | religion == 4)
gen rel_isla = (religion == 5)
gen rel_trad = (religion == 6)

*** LITERACY (FOR 15 YEARS OR OLDER) ***

gen lit_all = (age >= 15 & literacy >= 2 & literacy != .)
foreach X in lit_all {
replace `X' = . if age < 15
}
gen lit_all_m = lit_all if sex == 1
gen lit_all_f = lit_all if sex == 2

*** IN AGRICULTURE ***

gen job = (industry >= 1 & industry <= 99)
gen agri = (industry >= 1 & industry <= 5)
replace agri = . if job == 0

*** LITERACY OF AGRICULTURAL WORKERS ***

gen agrilit = agri*lit_all
gen agriage15 = (age >= 15 & agri == 1)
replace agriage15 = . if agri == . | age == .

*** WATER, TOILET ***

tab urbrur
gen rural = (urbrur == 2)
gen improvsani_rural = ((toilet == 1 | toilet == 2 | toilet == 3) & rural == 1)
tab improvsani_rural rural, m
gen improvwater_rural = ((water == 1 | water == 2 | water == 3 | water == 4) & rural == 1)
tab improvwater_rural rural, m
gen improvany_rural = (improvsani_rural == 1 | improvwater_rural == 1)
gen improvboth_rural = (improvsani_rural == 1 & improvwater_rural == 1)

*** OCCUPATION ***

gen occup_yn = (occupation >= 1 & occupation <= 100)
gen proftech = (occupation >= 1 & occupation <= 19)
gen adminmanag = (occupation >= 20 & occupation <= 29)
gen clerk = (occupation >= 30 & occupation <= 39)
gen sales = (occupation >= 40 & occupation <= 49)
gen servwork = (occupation >= 50 & occupation <= 59)
gen agriwork = (occupation >= 60 & occupation <= 69)
gen prodwork = (occupation >= 70 & occupation <= 79)
gen otherwork = (occupation >= 80 & occupation <= 100)
foreach X in proftech adminmanag clerk sales servwork agriwork prodwork otherwork {
replace `X' = . if occup_yn != 1
}

*** CHILD MORTALITY VARIABLES ***

gen femage1549 = (sex == 2 & age >= 15 & age <= 49)
gen ceb1549 = ceb if femage1549 == 1
gen cs1549 = cs if femage1549 == 1

* boys vs girls 
gen cebboys1549 = cebboys if femage1549 == 1
gen csmale1549 = csmale if femage1549 == 1
gen cebgirls1549 = cebgirls if femage1549 == 1
gen csfemale1549 = csfemale if femage1549 == 1

* for rural women only
gen ceb1549_rural = ceb if femage1549 == 1 & rural == 1
gen cs1549_rural = cs if femage1549 == 1 & rural == 1

*** CREATION OF THE DATA AT THE CELL LEVEL ***

* We keep the cells with at least 10 observations
bysort gridcell: egen cell_pop = sum(prop_in_grid)
drop if cell_pop < 10

foreach X of varlist birth12m ageb4 age15 age25 rel*_* lit_* femage* ceb* cs* agri agrilit agriage15 occup_yn proftech adminmanag clerk sales servwork agriwork prodwork otherwork improv* rural {
replace `X' = `X' * prop_in_grid 
}

* We obtain the "sums" at the cell level. 
* Below we divide by population to obtain the means when necessary.
collapse (sum) prop_in_grid birth12m ageb4 age15 age18 age25 rel*_* lit_* femage* ceb* cs* agri agrilit agriage15 occup_yn proftech adminmanag clerk sales servwork agriwork prodwork otherwork improv* rural, by(gridcell)
drop if gridcell == ""
ren prop_in_grid population
* Total population in the cell
codebook gridcell
* We have 1895 with enough observations. 

*** CLEANING THE GRID CELL DATA ***

* We now further clean some of the variables or create new variables.

** Religion **

foreach X of varlist rel_*  {
replace `X' = `X'/population*100
sum `X', d
}
label var rel_chri "Share of Catholics + Protestants in total population (%)"
label var rel_chri2 "Share of Christians (incl. Pentecostal & Other) in total population (%)"
label var rel_prot "Share of Protestants in total population (%)"
label var rel_cath "Share of Catholics in total population (%)"
label var rel_otchri "Share of Other Christians in total population (%)"
label var rel_pent "Share of Pentecoastals/Charismatic in total population (%)"
label var rel_prot2 "Share of Protestants (incl. Pentecostal & Other) in total population (%)"
label var rel_isla "Share of Islam in total population (%)"
label var rel_trad "Share of Traditional Religions in total population (%)"

** Literacy **

foreach X of varlist lit_* {
replace `X' = `X'/age15*100
sum `X', d
}
drop age15
label var lit_all "Literacy rate (any language) in 15+ population (%)"

*** Literacy of farmers and males and females ***

gen lit_agri = agrilit/agriage15*100
sum lit_agri, d
sum lit_all, d
drop agrilit agriage15
label var lit_agri "Literacy rate (any language) in 15+ pop. & agricultural workers (%)"
label var lit_all_m "Literacy rate of males in 15+ population (%)"
label var lit_all_f "Literacy rate of females in 15+ population (%)"

*** Improved water/sanitation ***

gen impsani_rural = improvsani_rural/rural*100
sum impsani_rural, d
drop improvsani_rural
gen impwater_rural = improvwater_rural/rural*100
sum impwater_rural, d
drop improvwater_rural

gen impany_rural = improvany_rural/rural*100
sum impany_rural, d
drop improvany_rural
gen impboth_rural = improvboth_rural/rural*100
sum impboth_rural, d
drop improvboth_rural

drop impsani_rural impwater_rural
label var impany_rural "Improved sanitation fac. or improved water source - rural"
label var impboth_rural "Improved sanitation fac. and improved water source - rural"

*** Occupation ***

foreach X in proftech adminmanag clerk sales servwork agriwork prodwork otherwork {
replace `X' = `X'/occup_yn*100
sum `X'
label var `X' "Share of `X' in pop of occup workers (%)"
}
egen test = rsum(proftech adminmanag clerk sales servwork agriwork prodwork otherwork)
sum test, d
* ok
drop test

foreach X in proftech adminmanag clerk sales servwork agriwork prodwork otherwork {
ren `X' `X'_sh
}
* We create the skilled occupation share. 
gen cogn_st_sh = proftech + adminmanag
gen cogn_br_sh = proftech + adminmanag + clerk
drop proftech adminmanag clerk sales servwork agriwork prodwork otherwork
label var cogn_st_sh "Skilled occup. sh. - strict - (admin., manag., prof., tech.)"
label var cogn_br_sh "Skilled occup. sh. - broad - (also incl. clerk)"

** Child mortality **

label var ceb1549 "Number of children even born (15-49 women)"
label var cs1549 "Number of children who died (15-49 women)"
label var cebboys1549 "Number of boys even born (15-49 women)"
label var cebgirls1549 "Number of girls even born (15-49 women)"
label var csmale1549 "Number of boys who died (15-49 women)"
label var csfemale1549 "Number of girls who died (15-49 women)"
label var ceb1549_rural "Number of children even born (15-49 women, rural)"
label var cs1549_rural "Number of children who died (15-49 women, rural)"

** Population **

ren population census_count
label var census_count "Number of observations in the 10% census sample"

** We drop the variables we do not need **

drop occup_yn agri age* birth* femage* ceb cebboys cebgirls csmale csfemale cs rural 

** We save **

sort gridcell
save census2000_gridlevel_jebo, replace

